Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.1 - Looking at Your Data")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

Selecting a Subset of Columns

When you're working with raw data, you are usually only interested in a subset of columns. This means you should get into the habit of only selecting the columns you need before any spark transformations.

Why?

If you do not, and you are working with a wide dataset this will cause your spark application to do more work than it should do. This is because all the extra columns will be shuffled between the worker during the execution of the transformations.

This will really kill you, if you have string columns that have really large amounts of text within them.

Note

Spark is sometimes smart enough to know which columns aren't being used and perform a Project Pushdown to drop the unneeded columns. But it's better practice to do the selection first.

Option 1 - select()

(
    pets
    .select("id", "nickname", "color")
    .toPandas()
)
id nickname color
0 1 King brown
1 2 Argus None
2 3 Chewie None

What Happened?

Similar to a sql select statement, it will only keep the columns specified in the arguments in the resulting df. a list object can be passed as the argument as well.

If you have a wide dataset and only want to work with a small number of columns, a select would be less lines of code.

Note

If the argument * is provided, all the columns will be selected.

Option 2 - drop()

(
    pets
    .drop("breed_id", "birthday", "age")
    .toPandas()
)
id nickname color
0 1 King brown
1 2 Argus None
2 3 Chewie None

What Happened?

This is the opposite of a select statement it will drop an of the columns specified.

If you have a wide dataset and will need a majority of the columns, a drop would be less lines of code.

Summary

  • Work with only the subset of columns required for your spark application, there is no need do extra work.
  • Depending on the number of columns you are going to work with, a select over a drop would be better and vice-versa.

results matching ""

    No results matching ""